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Abstract 

This  research  address  three  closely  related  problems.  (1)  Most  current  search 
technology  is  based  on  a  popularity  metric  (e.g.,  PageRank  or  ExpertRank),  but  not  on  the 
semantic  content  of  the  document.  (2)  When  building  components  in  a  service-oriented 
architecture  (SOA),  developers  must  investigate  whether  components  that  meet  certain 
requirements  already  exist.  (3)  There  is  no  easy  way  for  writers  of  requirements  documents  to 
formally  specify  the  meaning  and  domain  of  their  requirements.  Our  goal  in  the  research 
presented  here  is  to  address  these  concerns  by  designing  a  search-engine  that  searches  over 
the  “meanings”  of  requirements  documents.  In  this  paper,  we  present  the  current  state  of  the 
ReSEARCH  project. 

Keywords:  Semantic  Search,  Requirements,  Open  Architecture,  Information  Systems 
Technology 
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1  Motivation 


While  modern  computing  has  made  it  possible  to  access  enormous  amounts  of  information 
with  little  effort,  much  of  that  information  comes  without  any  indexing,  making  manual 
search  of  it  all  but  impossible.  The  science  of  information  retrieval  (IR)  attempts  to  correct 
for  this  by  extracting  information  from  a  collection  of  documents  based  upon  a  search  request, 
or  query.  While  the  field  of  IR  has  focused  a  great  deal  of  attention  on  how  the  form,  or 
syntax,  of  a  query  and  the  documents  in  the  collection  can  aid  the  process  of  extracting 
information,  it  has  paid  far  less  attention  to  the  meanings,  or  semantics,  of  those  forms. 

Semantic  analysis  can  be  computationally  intensive,  and  for  certain  domains,  sensitiv¬ 
ity  to  meaning  may  not  provide  a  system  with  sufficient  improvement  to  justify  the  greater 
computational  cost  incurred.  However,  there  are  at  least  two  conditions  in  which  semanti¬ 
cally  sensitive  search  can  lead  to  improvements  over  keyword-based  approaches:  a)  when  the 
document  collection  is  composed  of  human-generated  free-text,  and  b)  when  the  document 
collection  is  in  a  specialized  domain,  with  non-standard  terminology  and  assumptions  (where 
the  standard  for  most  IR  is  the  general  content  of  the  World  Wide  Web). 

The  Software.  Hardware  Asset  Reuse  Enterprise  (SHARE)  repository  card  catalog  is 
in  the  intersection  of  both  conditions  mentioned  above.  The  SHARE  card  catalog  should 
ideally  allow  a  user  to  search  for  an  asset  based  upon  free-text  overviews  generated  during 
asset  submission,  as  well  as  additional  structured  metadata  (Johnson  &  Blais,  2008).  Because 
this  overview  is  written  in  free-text,  the  syntactic  form  in  which  the  information  expressed 
by  the  overview  cannot  be  guaranteed  in  advance,  making  search  over  it  quite  difficult. 
In  addition,  the  elements  being  searched  over  are  descriptions  of  military  assets.  So,  the 
document  collection  for  this  IR  task  is  in  a  specialized  domain,  and  the  search  process  should 
be  sensitive  to  the  semantic  connections  that  are  particular  to  this  domain. 

In  order  to  appreciate  the  challenges  posed  by  IR  over  free-text  and  in  specialized  domains, 
we  now  turn  to  the  complications  that  each  condition  brings  to  the  task. 

1.1  Challenges  of  Free- Text  Search 

Human  language  in  general  has  several  properties  that  make  information  retrieval  taxing. 
Formally  speaking,  any  language,  human  or  man-made,  can  be  expressed  as  a  relation  between 
/orm  (syntax)  and  meaning  (semantics);  thus  fluency  in  a  domain  consists  of  knowing  the 
relation  Iretween  the  forms  of  the  language  and  their  corresponding  meanings.  Man-made 
languages  often  aim  to  make  this  relation  as  straightforward  as  possible.  For  instance,  in 
the  mathematical  language  of  arithmetic  the  syntactic  symbol  “+”  stands  for  the  semantic 
concept  of  numerical  addition.  However,  note  that  the  symbol  ' ”  can  stand  for  two  different 
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semantic  concepts:  numerical  subtraction  or  the  marking  of  negative  numbers.  Thus,  ”  has 
multiple  meanings,  and  we  say  that  it  is  polysemous.  In  arithmetic,  only  ”  is  polysemous, 
but  in  human  languages  polysemy  is  pervasive.  The  word  tank,  for  example,  has  multiple 
meanings  (or,  senses) — it  may  refer  to  weaponry  or  a  water  tank.  In  the  information-retrieval 
context,  polysemy  renders  a  query  (and  sentences  in  the  document  set)  ambiguous:  if  the 
user  is  searching  for  tank  specifications,  are  they  asking  about  water  tanks  or  weaponry? 

Polysemy  complicates  the  form-meaning  relation  by  having  multiple  possible  meanings  for 
a  given  word.  In  addition,  human  language  routinely  has  multiple  words  attached  to  a  given 
meaning.  We  call  these  synonyms.  For  example,  the  verb  consume  has  many  synonyms,  e.g., 
devour,  ingest,  eat.  If  a  user  enters  the  query  “What  type  of  fuel  does  an  F-22  consume?”, 
without  an  understanding  of  synonymy,  the  system  will  not  be  able  to  return  to  the  user, 
for  example,  a  document  containing  the  answer  “The  F-22A  Raptor  uses  JP-8”.  Hence, 
synonymy  complicates  the  information-retrieval  task  by  creating  the  (quite  likely)  possibility 
that  the  meaning  requested  by  the  user  is  expressed  in  a  different  form  than  the  one  the 
user  used  in  the  query.  Synonymy  may  occur  at  all  levels  of  linguist  ic  form;  for  example,  the 
sentences,  “The  F-22A  uses  JP-8”  and  "JP-8  is  the  fuel  type  for  the  F-22A  Raptor”  convey 
the  same  semantic  information  despite  their  rather  different  forms.  In  particular,  the  first 
sentence  lacks  a  synonym  for  fuel,  meaning  that  sentence- level  synonymy  cannot  simply  be 
the  product  of  word-level  synonymy. 

One  final  complicat  ion  of  searching  over  human  language  is  that  the  relationships  between 

semantic  entities  are  not  necessarily  represented  in  the  syntactic  forms  of  the  entities.  For 
instance,  the  semantic  entities  mother  and  daughter  are  connected  by  a  parental  relation.  In 
order  to  determine  the  sentential  synonymy  of  Mary  is  Jane’s  mother  and  Jane  is  Mary’s 
daughter,  the  system  must  understand  the  the  relationship  between  mother  and  daughter. 
This  is  a  rather  challenging  task  if  we  are  simply  looking  at  linguistic  form,  as  there  is  not  hing 
in  the  words  mother  and  daughter  that  indicates  they  are  connected.  Such  information  is 
accessible  only  once  we  have  some  representation  of  the  meanings  of  the  words  (or  larger 
elements),  and  some  way  of  deriving  inferences  between  them. 

1.2  Challenges  of  Domain-dependent  Search 

As  detailed  in  (Johnson  &  Blais,  2008),  the  SHARE  repository  asset  library  currently  consists 
of  combat  systems  software  and  supporting  artifacts,  but  will  become  more  diverse  (e.g., 
through  the  incorporation  of  hardware  components).  The  card  catalog  will  thus  contain 
information  about  the  specification  and  function  of  such  artifacts.  As  Johnson  and  Blais 
note,  there  is  (and  will  continue  to  be)  a  high  level  of  similarity  between  the  SHARE  artifacts, 
given  that  they  are  all  specifiable  under  the  Surface  Navy  OA  Warfare  Systems  Architecture 
Element  Level  Decomposition.  Hence,  their  overviews  will  share  many  characterist  ics  atypical 
of  documents  found  on  the  Web,  making  Web-based  tools  sub-optimal. 
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However,  this  speciality  of  SHARE'S  domain  could  prove  an  advantage  for  semantically 
sensitive  search.  For  instance,  it  could  allow  for  a  reasonably  robust  polysemy  control.  For 
example,  in  the  SHARE  context,  a  query  involving  consume  is  more  likely  to  refer  to  fuel 
usage  than  to  eating.  Similarly,  the  domain  could  aid  semantic  inferencing  (of  the  sort 
exemplified  by  the  pair  mother- daughter)  based  both  on  terms  in  the  free-text  overview  and 
the  larger  functional  context  of  a  particular  asset.  Hence,  based  on  facts  regarding  the  objects 
under  discussion  within  SHARE,  the  system  could  conclude  that  there  is  a  relation  between 
ballistics  and  shell-size,  allowing  searches  regarding  one  to  consider  documents  containing  the 
latter.  Additionally,  building  on  the  product  lines  that  assets  play  in  the  Navy  enterprise, 
the  system  could  infer  that  a  given  asset  possesses  certain  properties  that  may  be  useful  the 
user. 


1.3  Domain-independent  Learning  for  Domain- dependent  Rules 

Given  that  the  polysemy  and  inferencing  subsystems  we  are  building  are  particular  to  the 
specialized  domain  of  SHARE,  one  natural  question  is  how  such  subsystems  will  lie  devel¬ 
oped?  One  possibility  is  to  build  a  handwritten  set  of  rules,  and  have  the  IR  system  look  to 
those  rules  when  performing  inferencing,  such  as  that  implemented  in  the  Wordnet  project. 
While  such  strategies  are  undoubtedly  useful,  they  typically:  a)  are  time-consuming,  b)  lack 
empirical  coverage  (human  error  may  cause  a  rule  to  go  unnoted),  and  c)  require  constant 
supervision  for  a  dynamic  document  collection.  All  three  pitfalls  are  of  concern  with  regard  to 
SHARE;  the  most  troubling  is  probably  the  requirement  for  constant  maintenance,  given  that 
SHARE  is  an  evolving  repository  and  a  potential  model  for  similarly  constrained  repositories 
over  different  kinds  of  assets. 

Given  such  problems,  we  propose  that  the  domain-dependent  components  of  ReSEARCH 
be  generated  not  by  human  input  but  by  machine  learning  over  the  document  collection  of 
SHARE  and,  in  the  initial  stages,  additional  informational  resources.  The  goal  of  ReSEARCH 
is  to  develop  a  system  of  tools  for  determining  domain-dependent  resources  to  address  the 
issues  surrounding  polysemy,  synonymy  and  inference. 

The  remainder  of  this  paper  will  detail  the  problems  of  contemporary  approaches  to  IR 
and  our  investigations  of  approaches  to  integrate  semantically  sensitive  tools.  In  the  following 
section,  wre  will  present  an  overview  of  common  approaches  to  IR,  and  why  they  will  fail  in 
dealing  with  collections  such  as  SHARE.  We  will  then  discuss  the  algorithmic  issues  involved 
in  generating  tools  that  allow  semantically  sensitive  searching.  Finally,  we  will  present  the 
current  status  of  our  implementation  of  the  ReSEARCH  system. 

1Wordnet  is  an  ongoing  project  directed  by  George  Miller  at  Princeton  University’s  Cognitive  Science 
Laboratory  to  encode  relations  between  semantic  entities.  It  may  be  accessed  at  http://wordnet.princeton.edu. 
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2  Prior  and  Current  Strategies 

The  Web  is  a  tremendously  useful  repository  of  information.  Unfortunately,  this  information 
is  unstructured,  and  there  is  no  canonical  “Table  of  Contents”  or  “Index,”  making  web  search 
one  of  the  most  challenging  of  today’s  Internet  problems.  Two  attempts  were  made  to  address 
this  challenge:  (1)  hand-classified  directories  (as  originally  used  by  Yahoo,  for  example),  and 
(2)  query-based  search  engines  (for  example,  AltaVista  and,  eventually,  Google).  This  second 
class  is  what  concerns  us  here.  For  more  details  on  Web  Search  Engines  see  (Schwartz,  1998). 

Search  engines  employ  a  centralized  architecture  in  which  so-called  “spiders”  collect  web¬ 
site  information,  and  an  indexer  makes  an  index  of  these  pages  to  ease  the  search.  In  the  early 
1990s,  the  first  phase  of  web  search  was  simply  keyword  search.  In  keyword-based  search, 
all  pages  containing  requested  keywords  are  returned,  ranked  according  to  the  strength  of 
match  (e.g.,  the  number  of  times  a  word  appears  in  a  document,  if  it  is  in  the  title,  etc.). 

AltaVista  used  this  strategy  originally.  In  1995,  it  was  the  first  company  to  fully  index 
the  visible  pages  on  the  World  Wide  Web.  Over  time,  it  evolved  different  search  modes: 
basic  search,  advanced  search,  and  power  search  (Notess,  n.d.).  One  advanced  feature  that 
AltaVista  and  other  search  engines  added  was  stemming  (Sapp,  2000).  Stemming  ensures 
that  words  with  plurals  and  suffixes  (e.g.,  -ed,  -ing,  -er)  are  always  treated  as  being  in  their 
stem  form  (Hersh,  2003.  p.  178).  Unfortunately,  it  is  unclear  how  useful  stemming  is  in  the 
search  process  (Harman,  1991). 

The  second  phase  in  Web  search  was  the  development  of  techniques  that  used  the  connec¬ 
tion  between  pages  to  create  a  ranking  of  the  websites  for  more  accurate  search.  The  indexing 
problem  was  changed  into  finding  the  most  appropriate  way  to  “rank”  each  website.  One 
easy  solution  was  to  make  the  rank  proportional  only  to  the  number  of  other  pages  linking 
to  the  page  in  question.  However,  this  ranking  method  turned  out  to  be  inaccurate  lor  a 
variety  of  reasons.  In  particular,  it  did  not  take  into  account  the  source  of  the  links,  allowing 
someone  to  easily  boost  the  rank  of  a  page  by  increasing  the  number  of  incoming  links,  thus 
subverting  the  indexing  mechanism  (Langville  &  Meyer,  2(H)b). 

To  avoid  this  index  subversion,  new  methods  needed  to  be  developed  which  took  advantage 
of  the  link  structure  of  both  the  Web  and  the  meaning  of  the  queried  word  so  the  output  was 
most  relevant  to  the  query.  The  challenge,  then,  was  to  increase  the  relevance  of  the  returned 
pages  to  the  query  itself. 

2.1  Page  Rank  and  Expert  Rank 

In  1998,  Google  revolutionized  search.  They  did  this  not  by  changing  the  fundamentals,  as 
the  pages  returned  are  still  those  that  match  the  keywords  in  the  query,  but  by  changing  the 
order  in  which  the  return  pages  were  presented.  Google  ranked  all  pages  according  to  the  a 
then-novel  ranking  algorithm  called  PageRank  (Brin  &:  Page,  1998). 

The  essence  of  the  Google  innovation  is  in  how  the  PageRank  algorithm  works.  The  rank 
of  each  page  in  a  search  depends  not  only  on  the  number  of  pages  pointing  to  it,  but  also 
on  the  rank  and  the  number  of  outgoing  links  of  these  pages.  To  further  determine  the  rank 
of  all  web  pages,  Google  simulates  the  behavior  of  virtual  surfers  randomly  surfing  the  web. 

A  page’s  rank  is  then  updated  based  on  how  frequently  the  random  surfers  visit  that  page. 
This  pre-existing  rank  of  each  individual  website  is  assigned  independently  of  any  query.  As 
a  result  of  this  ranking,  the  pages  are  ranked  in  order  of  sociological  importance:  the  more 
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links  with  higher  weight  there  are  to  a  page,  the  more  important  it  is  in  the  “society”  of 
pages.  Additionally,  hubs — pages  that  have  a  lot  of  links  pointing  to  them — are  given  greater 
authority.  In  other  words,  the  importance  of  a  link  is  determined  by  both  the  rank  of  the 
linking  page  and  the  number  of  outgoing  links  from  that  page.  One  pitfall  of  this  scheme 
(which  Google  attempts  to  corrects)  is  that  communities  of  websites  can  trap  random  surfers, 
which  in  turn,  increases  the  rank  of  those  websites. 

As  an  example  of  how  this  is  implemented,  let  “Q”  be  a  query,  a  list  of  words,  that  is 
saved  as  a  vector  q.  whose  binary  components  show  whether  a  particular  word  is  present 
or  not  in  Q.  Also,  the  information  on  each  website  R>,  ■  ■  ■  is  similarly  saved  as  the 
binary  vectors  rq ,  fq,  •  •  •,  respectively.  By  computing  the  inner  product  (q,  r*),  i  >  1,  or  the 
cosine  similarity  measure  using  the  normalized  vectors  corresponding  to  the  vectors  above, 
the  system  can  identify  the  similarity  between  the  query  and  the  pages  in  the  universe. 
However,  this  method  does  not  take  into  account  the  correlation  between  websites  and  their 
semantics.  To  overcome  this  problem,  PageRank  uses  the  PageRank  metric  PR(P)  that 
defines  recursively  the  rank/importance  of  each  page  P  by 


PR(P)  =  (1  -  d)  +  PR(Ti)/C(Tn)), 

i= 1 

where  d  is  a  damping  factor  (0.15,  sis  used  by  Brin  and  Page  in  (Brin  &  Page,  1998|)),  7* 
(1  <  i  <  ii)  are  all  the  pages  pointing  to  P.  and  each  7)  has  (7(7* )  outgoing  links.  So, 
P  receives  a  fraction  of  the  weight  P/?(7<),  as  this  weight  is  equally  spread  among  all  the 
outgoing  links  from  7*,  for  1  <  i  <  n  (Zhang  k  Dong,  2000). 

Ask.com,  formerly  known  as  “Ask  Jeeves,”  is  another  search  site  offering  state-of-the-art 
search,  this  time  based  on  technology  called  “ExpertRank”  (Ask.com,  n.d.).  In  addition 
to  examining  the  numlx'r  of  links  entering  a  site.  ExpertRank  also  attempts  to  identify 
topic  clusters  related  to  a  search,  as  well  as  experts  within  these  topics,  and  use  all  of  this 
information  to  rank  search  results. 


2.2  The  State  of  Online  Search  Using  Natural  Language  Processing 

Since  the  “Semantic  Web”  has  become  a  buzzword  in  the  Internet  community  and  in  business 
at  large,  several  organizations  have  emerged  to  provide  “Semantic  Search.”  Many  promising 
companies  and  research  projects  have  built  search  systems  that  crawl  the  web  for  annotated 
data  over  which  to  search,  such  as  web  sites  with  RDF  data.  This  search  strategy,  however, 
does  not  allow  searching  documents  that  do  not  have  rich,  hand-built,  metadata.  In  par¬ 
ticular.  the  vast  majority  of  documents  online,  written  in  natural  human  language,  are  not 
searched.  A  small  sulwet  of  these  search  engines,  however,  have  begun  tackling  the  problem 
of  searching  documents  consisting  only  of  written  language,  extracting  semantic  meaning. 

Powerset  Labs  (www.Powerset.com),  a  San  Francisco- based  startup,  has  positioned  it¬ 
self  as  a  forerunner  in  this  field  by  attempting  to  leverage  natural  language  processing  in 
their  search  system.  Currently  honing  their  search  algorithm,  Powerset  indexes  and  searches 
Wikipedia  for  question-answering  tasks.  The  documents  in  this  database  are  written  in  plain 
text  and,  for  the  purposes  of  search,  do  not  cont  ain  extended  metadata.  Instead,  the  Power- 
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set  indexing  algorithm  identifies  linguistic  features  such  as  named  entities  and  parts  of  speech 
to  improve  search  results. 

Being  a  private,  for-profit  company,  the  Powerset  search  algorithm  is  not  public,  but  some 
important  functionality  can  be  extracted  from  public  demonstrations.  The  Powerset  labs 
website  currently  contains  two  methods  of  searching  Wikipedia.  The  first  is  a  general  search 
of  the  document  index,  which  encourages  queries  to  be  phrased  as  questions.  Queries  such  as 
"When  did  earthquakes  hit  San  Francisco?”  and  ‘•politicians  from  Virginia”  are  among  the 
suggested  queries.  Results  of  these  ciueries  return  results  that  demonstrate  term  matching 
on  a  higher  level  than  keyword  search.  For  example,  the  Powerset  system  uses  “When”  as 
a  wildcard  to  match  dates  and  times  that  appear  in  phrases  describing  earthquakes  in  San 
Francisco.  “From"  is  used,  in  the  second  example,  to  search  for  phrases  that  indicate  some 
named  entity  is  “from"  Virginia.  This  improves  results  significantly  over  a  search  with  just 
the  keywords  "politicians"  and  “Virginia,"  as  are  used  in  standard  search  engines. 

The  search  “politicians  from  Virginia"  also  reveals  that  “politicians”  matches  terms  such 
as  “governor”  and  “senator”,  indicating  that  an  ontology  is  used  to  match  the  term  “politi¬ 
cian"  with  its  hyponym,  “governor.”  The  search  “What  do  zombies  eat?”  reveals  that  the 
Powerset  algorithm  also  searches  over  synonyms  by  returning  results  containing  the  syn¬ 
onymous  verb  “devour.”  This  system  does  not  perform  rich  disambiguation,  however,  as 
evidenced  by  the  result  “. . .  zombie  finishes  college,”  in  which  “finishes”  is  considered  a  syn¬ 
onym  of  “eat”. 

Finally,  results  from  the  Powerset  search  “What  do  zombies  eat”  include  phrases  in  which 
the  information  about  what  zombies  eat  is  encoded  in  more  complex  sentence  structures.  Cor¬ 
rect  results  such  as  “granddaughter  eaten  by  zombies,”  “zombies  . .  .where  they  are  brought 
back  from  the  dead  by  supernatural  or  scientific  means,  eat  the  flesh  or  brains  of  the  living", 
and  “His  corpse  is  thrown  over  the  fence  to  be  devoured  by  the  zombies”,  all  reveal  that 
powerful  parsing  of  the  sentences  is  performed  in  the  indexing  process  rather  than  strictly 
requiring  matching  phrases  such  as  “zombies  eat  *”.  Though  their  indexing  structure  is  not 
known,  the  "PowerMouse"  demonstration  allows  the  user  to  search  the  fact  index  more  di¬ 
rectly.  confirming  that  these  relationships  exist  in  the  indexing  for  fast  searching,  eliminating 
the  need  for  computationally  expensive  parsing  with  every  search  query. 

Powerset  is  thus  building  capabilities  for  semantically  sensitive  search  similar  to  those 
of  ReSEARC’H.  However,  it  is  not  clear  that  Powerset’s  approach  is  designed  to  handle  the 
domain-specificity  of  collections  like  SHARE,  meaning  that  it  is  not  clear  their  technology 
can  be  leveraged  to  construct  novel  inferencing  mechanisms  in  particular  domains. 


3  Automated  Inference-rule  Discovery 

Recall  that  natural  languages,  unlike  formal  taxonomic  structures,  contain  inherent  ambiguity 
of  both  form  and  meaning.  It  is  this  ambiguity  that  presents  a  challenge  for  natural  language 
applications  such  as  information  retrieval  or  question  answering.  Two  questions  arise:  1) 
which  meaning  of  a  word  or  phrase  in  a  search  term  does  the  requester  intend,  and  2)  how 
do  we  return  results  that  are  related  to  the  search  query,  even  if  the  search  term  does  not 
contain  the  exact  word  or  words?  The  first  question  is  related  to  the  problem  of  woixl  sense 
disambiguation  and  is.  itself,  a  well-studied  area.  We  shall  turn  our  attention  to  the  second 
problem:  inference. 
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3.1  Semantic  Similarity  from  Distributional  Similarity 

In  (Hearst,  1992),  Hearst  explored  using  one  kind  of  inference  rules  to  generate  others,  given 
a  body  of  text.  Specifically,  she  considered  how  use  of  synonymy  relations  could  be  used 
to  learn  the  relation  of  hypemymy,  or  subtype  classification.  For  concreteness,  consider  the 
pair  vehicle- Humvee.  As  Humvee  is  a  subtype  of  vehicle ,  the  latter  is  a  hypemym  of  the  for¬ 
mer.  How  could  a  machine  learn  the  hypemymy  relation  of  vehicle- Humvee  automatically? 
Hearst’s  approach  exploited  the  fact  that  the  co-occurrence  of  words  in  patterns  of  the  type  X 
such  as  V',  as  well  as  its  synonyms  A',  including  Y.  and  Y  and  other  X.  implies  a  hypemymic 
relationship  between  X  and  V'.  As  she  demonstrated,  if  a  system  were  seeded  with  vari¬ 
ous  synonyms  for  forms  that  demonstrate  hypemymy,  the  system  could  induce  hypemymic 
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Figure  2.  Dependency  Tree  of  Same  Sentence  as  in  Figure  1 


connections  from  the  text  provided. 

While  Hearst 's  method  is  useful  for  learning  various  inferences,  it  relies  upon  human¬ 
generated  synonyms  for  expression  of  hypemymy  (or  the  relation  in  question).  More  desirable 
would  be  a  system  that  learns  the  synonyms  themselves  from  the  text,  especially  given  the 
possibility  that  such  synonyms  could  be  domain-dependent.  In  their  2001  study.  “DIRT 
Discovery  of  Inference  Rules  from  Text” (Lin  k  Pantel.  2001).  Lin  and  Pantel  outline  an 
unsupervised  method  of  discovering  inference  rules  from  text,  based  on  the  idea  that  semantic 
similarity  is  generally  correlated  with  syntactic  similarity.  We  turn  to  this  next. 
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3.2  Dependency  Trees  and  Pat  hs 


A  dependency  relationship  is  an  asymmetric  binary  relationship  between  two  words:  a  head 
and  a  modifier.  One  can  oltserve  the  structure  of  a  sentence  by  examining  the  tree  formed 
of  the  dependency  relationships  contained  therein.  The  tree  structure  arises  from  the  char¬ 
acteristic  that  a  given  word  may  have  more  than  one  modifier,  but  each  word  may  modify 
a  maximum  of  one  word.  Note  that  a  dependency  tree  differs  from  a  parse  tree,  which  is 
concerned  with  the  syntactic  relationship  between  words.  A  comparison  of  the  two  are  shown 
in  Figures  1  and  2. 

Dependency  graphs  are  constructed  by  using  Lin's  MINIPAR.  a  broad  coverage  English 
language  dependency  parser  (Lin.  2008).  Links  in  the  graph  represent  indirect  semantic 
relationships  between  two  words.  A  dependency  path  is  constructed  by  joining  the  words 
and  their  link  dependency  relationships,  excluding  the  two  end  words.  For  instance,  in 
our  example  sentence  the  dependency  path  between  the  words  ship  and  rudder  would  be 
represented  by  the  path  N:subj:V<— lacks— »:V:obj:N.  The  words  ship  and  rudder  fill  the  slots 
in  the  path  at  either  end.  Non-slot  dependency  relations  are  called  internal  relations.  In  this 
manner,  one  can  construct  the  paths  of  all  word  pairs  in  a  given  corpus  of  text. 

Lin  and  Pantel  (Lin  &  Pantel,  2001)  imposed  a  set  of  constraints  on  the  paths  to  be 
extracted: 

•  The  "slot  fillers"  must  be  nouns,  since  these  are  variables  that  will  be  instantiated  by 
entities. 

•  Dependency  relations  that  do  not  connect  the  two  content  words  (e.g.,  in  the  case  of 
determiners  or  modifiers),  will  be  excluded  from  the  path. 

•  There  will  be  a  lower  limit  (threshold)  on  the  frequency  count  of  an  internal  relation. 

To  accumulate  the  frequency  counts  of  pat  Its  in  a  corpus,  a  triple  database  w'as  used. 
A  triple  is  comprised  of  (p,  Slot,  word)  for  twro  words  uq  and  w%.  Correspondingly,  each  such 
pair  of  words  has  two  corresponding  triples:  (p,  SlotX,  uq)  and  (p.  SlotY,  uq).  SlotX.  SlotY 
and  ti’i,  it -2  are  features  of  path  p. 

3.3  Path  Similarity 

As  alluded  to  above.  Lin  and  Pantel's  approach  makes  an  assumption  based  on  Harris’s 
Distributional  Hypothesis  (Harris.  1954).  which  assumes  that  two  words  will  have  a  similar 
meaning  if  they  appear  in  similar  contexts.  Instead  of  words.  Lin  and  Pantel  assume  that  the 
hypothesis  also  holds  for  paths  between  words;  i.e.,  if  multiple  dependency  tree  paths  link 
the  same  set  of  words,  then  the  meanings  of  the  paths  are  likely  similar.  They  termed  this 
the  Extended  Distributional  Hypothesis. 

Computing  similarity  between  two  paths  first  takes  into  account  the  mutual  informa¬ 
tion  between  a  path  slot  and  its  filler.  The  approach  is  similar  to  calculating  a  tf  •  idf(term 
frequency  x  inverse  document  frequency)  measurement  and  is  performed  for  a  similar  reason: 
to  discount  high  frequency  words  that  may  not  have  the  same  importance  as  less  frequent 
words.  Pantel  and  Lin's  formula  leverages  the  similarity  measurement  proposed  in  (Lin. 
1998),  but  is  modified  to  take  paths  into  account: 
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mi(p,  Slot ,w)  =  log 


|p,  Slot,  tu|  x  |*.  Slot,  *| 
|p,  Slot ,  *|  x  |*.  Slot .  tr|  ’ 


The  mut  ual  information  thus  defined,  the  similarity  between  a  pair  of  slots  is  defined  as: 

T,wtr(jnj)nr(p3,i)(mi(Pi’  “0  +  mi(P*>  s> w )) 


sim(slot\,  slots)  = 


E»€T(p,..)  mi(pi,s,  w)  +  EweT(p3.s)  mi(P 2-  w) ' 


In  this  formula,  pj  and  ps  are  patlis,  s  is  a  slot,  and  T(pi,  s)  is  the  set  of  all  words  that  fill 
the  s  slot  of  path  p,.  Finally,  the  similarity  of  two  paths  pi  and  pa  is  defined  by  the  geometric 
average  of  the  similarities  of  their  Slot. X  and  SlotY  slots: 

S(pi.p2)  =  y/sim(SlotXi,  SlotXs)  x  sim(SlotYi,  SlotYs). 

Comparison  of  patlis  in  a  corpus  is  accomplished  via  pairwise  comparison  of  each  path 
using  the  preceding  formulae.  Since  comparison  of  all  patlis  is  computationally  expensive, 
Lin  and  Pantel  use  a  filtering  algorithm  that  only  compares  paths  where  a  candidate  path's 
shared  features  with  an  input  path  p  exceed  a  fixed  percentage.  This  procedure  ultimately 
produces  a  list  of  patlis  in  descending  order  of  their  similarity  to  p. 

3.4  Results 

Lin  and  Pantel  (2001)  used  MINIPAR  to  parse  approximately  1GB  of  newspaper  text  from 
the  AP  Newsvnre,  San  Jose  Mercury-News,  and  The  Wall  Street  Journal.  From  this,  they 
extracted  seven  million  patlis.  231.000  of  them  unique,  which  were  then  stored  in  a  triple 
database.  For  evaluation,  they  used  the  first  six  questions  of  the  TREC-8  Question- Answering 
Track,  extracted  the  paths  from  the  questions,  and  generated  a  Top-40  Most  Similar  list  using 
their  algorithm  to  determine  if  the  generated  paths  might  contain  the  answer  to  the  questions 
posed.  This  output  was  also  compared  to  a  set  of  publicly  available,  manually  generated 
paraphrases  of  the  TREC  questions.  In  the  evaluation,  a  path  was  deemed  to  be  correct  if 
it  was  likely  that  the  path  could  generate  the  correct  response  to  the  question,  given  that 
the  answer  could  be  found  in  some  corpus.  An  example  used  by  Lin  and  Pantel  (2001)  was 
the  path  “X  manufactures  Y~  generated  from  the  TREC  question,  “ What  does  the  Peugeot 
company  manufacture One  of  the  Top-40  most  similar  paths  is  ‘‘Xs  Y  factory."  Since 
“ Peugeot  's  car  factory”  is  a  likely  phrase  in  some  corpus,  this  generated  path  is  classified  as 
correct. 

The  DIRT  algorithm  performance  varied  widely  for  different  paths.  It  was  noted  that 
patlis  with  verb  roots  tended  to  perform  better  than  verbs  with  noun  roots  since  noun  root 
patlis  tend  to  occur  less  often.  Lin  and  Pantel  (2001)  also  found  that,  even  with  high-scoring 
correct  paths,  there  was  little  overlap  between  these  automatically  generated  paths  and  the 
manually  generated  paraphrases,  suggesting  the  difficulty  for  humans  in  the  paraphrase- 
generation  task.  As  noted  earlier  in  studies  of  manual  inference-rule  generation,  completeness 
errors  exist  due  to  t  1m*  difficulty  of  paraphrase  recall  for  humans.  In  this  capacity,  the  DIRT 
algorithm  shows  promise  in  augmenting  a  manual-generation  workflow. 
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Q#  Paths 


Man.  DIRT  Int.  Acc. 


Q  t 

X  is  author  of  Y 

7 

21 

2 

52.5% 

<?2 

X  is  monetary  value  of  Y 

6 

0 

0 

N/A 

X  manufactures  Y 

13 

37 

4 

02.5% 

<?4 

X  spend  Y 

7 

16 

2 

40.0% 

spend  X  on  Y 

8 

15 

3 

37.5% 

Qs 

X  is  managing  director  of  Y 

5 

14 

1 

35.0% 

X  asks  Y 

2 

23 

0 

57.5% 

asks  X  for  Y 

2 

14 

0 

35.0% 

X  asks  for  Y 

3 

21 

3 

52.5% 

Table  1.  A  Summary  of  Lin  and  Pantel's  DIRT  Algorithm  Results  on  TREC-8  Questions. 


A  summary  of  DIRT  results  on  the  TREC  data  is  in  Table  1.  The  column  labeled  "Man.” 
indicated  the  numl>er  of  manual  paraphrases  generated  for  the  question.  The  next  column 
shows  the  number  of  paths  found  by  the  DIRT  algorithm.  The  intersection  of  those  two  is 
in  the  fifth  column.  The  final  column  shows  the  evaluated  accuracy  of  the  automatically 
generated  paths. 

3.5  Related  work 

Snow  et  al.  (Snow.  Jurafsky.  &  Ng.  2005)  leverage  a  similar  method  of  automated  inference- 
rule  discovery  using  dependency  paths  in  a  continuation  of  the  hypemvm  discovery  method 
pioneered  by  Hcarst .  This  method  involved  using  the  dependency  paths  in  a  feature  count 
vector  and  conducting  a  binary  classification  of  hypernymy  for  word  pairs  based  on  vector- 
distance  measurement.  The  results  obtained  represented  a  16%  F-score  improvement  over 
previous  models,  and  a  40%  improvement  when  augmented  with  coordinate  terms  (i.e., 
terms  that  share  a  common  hypernym  ancestor). 


4  Implementation  issues 

Lucene  Java  is  an  Open  Source  project,  available  under  the  Apache  License,  which  provides 
an  accessible  API  for  the  development  of  search  applications.  Lucene  provides  plenty  of 
opportunities  to  construct  a  semantic  search  engine.  A  good  overview  and  documentation  is 
available  from  the  Apache  Lucene  website  {Lucene-java  Wiki,  2008). 

A  search  application  developed  with  Lucene  consists  of  the  same  two  major  components 
mentioned  in  Section  2:  an  indexer  and  a  searcher.  The  indexer  builds  an  index  of  the  sriven 
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documents;  the  structure  and  content  of  this  index  depends  on  the  implementation  of  the 
indexer  application.  Typical  contents  would  be  the  title  of  a  document,  its  path,  a  URL.  or 

the  actual  text  content  .  Content  can  be  stored  in  different  ways,  depending  on  if  it  has  to 
be  searchable  or  not.  The  search  application  typically  converts  a  search  string  given  by  the 
user  into  a  query  and  then  searches  the  index  for  matching  items.  Later  in  this  section,  two 
short  examples  will  demonstrate  these  processes. 


4.1  Interesting  Features  of  the  Lucene  API 

One  remarkable  property  of  Lucene  is  its  flexibility.  By  overriding  the  stemming  and  ana¬ 
lyzing  algorithms,  its  behavior  can  be  changed  into  something  completely  new.  particularly 
from  a  keyword  search  engine  into  a  semantic  search  engine  similar  to  Powerset:  however,  a 
very  useful  property  of  Lucene  is  its  accessibility  from  different  environments,  e.g..  Python. 

PyLucene  is  a  Python  extension  for  accessing  Java  Lucene.  This  extension  allows  devel¬ 
opers  to  implement  some  functionality  of  the  desired  application  using  NLTK.  a  widely  used 
Pvthon-based  project  for  natural  language  processing.  Documentation  and  implementation 
samples  for  PyLucene  can  be  found  in  Vajda  (2005). 

Another  very  helpful  feature  is  a  package  for  indexing  and  query  expansion  based  on 
WordNet  synonyms.  Using  the  WordNet  application,  t  his  package  creates  a  synonym  index 
of  words  and  converts  search  strings  into  queries  which  can  be  used  by  Lucene.  For  our  first 
tests,  we  built  an  index  of  synonyms  from  WordNet  and  used  it  to  expand  and  convert  search 
strings  into  Lucene-compatible  queries. 


4.2  The  Wikipedia  Corpus 

For  our  experiments  we  decided  to  use  downloadable  Wikipedia  content  (http: //download 
.wikiraedia.org/enwiki/20080312/enwiki-20080312-pages-articles. xml .bz2). 

The  size  of  this  file  is  about  60  GB.  This  size  requires  an  event-based  parser  such  as 
SAX.  For  the  first  experiments  only  about  160  MB  (more  than  12.000  articles)  from  a  partial 
download  were  used. 

The  structure  of  t  he  XML  file  is  as  follows:  every  art  icle  is  stored  in  a  <page>  node,  which 
has  several  child  nodes.  From  these  child  nodes  we  used  the  <title>  and  <text>  fields.  The 
special  syntax  of  a  Wikipedia  page  was  ignored  at  first,  meaning  that  all  the  content  of  an 
article  was  given  the  same  priority  particularly  we  did  not  distinguish  between  headings, 
links  or  normal  text. 

Parsing  and  indexing  12.738  articles  took  about  four  minutes  on  a  Windows  Vista  PC 
with  an  AMD64  CPU  and  1  GB  memory  under  non-benchmark  conditions. 

4.3  Sample  Implementations 

Two  sample  implementations  will  be  introduced:  a  Wikipedia  indexer  and  a  small  search 
application. 

The  indexer  follows  a  sample  given  in  (Schmidt.  2005).  The  original  version  had  to  be 
changed  in  order  to  obtain  compatibility  to  the  current  version  of  Lucene.  Only  the  main 
concepts  will  be  considered  at  this  point:  for  further  explanations  of  the  different  classes 
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involved,  see  ( Lucene-java  Wiki.  2008)  and  Lucene's  Javadoc.  The  main  part  of  an  indexing 
application  is  the  index  writer.  It  writes  the  index  into  a  hie  system  and  also  optimizes  its 
structure  for  faster  access.  Logically,  the  written  index  consists  of  documents;  in  our  case, 
every  Wikipedia  article  is  treated  as  a  separate  document.  A  document  is  then  split  into 
different  fields;  for  the  sample  application,  these  fields  were  “title"  and  “text.”  The  indexer 
determines  whether  and  how  a  Held  is  stored  in  the  index.  The  choices  are:  (1)  not  to  store 
at  all.  (2)  to  store,  but  not  to  index.  (3)  to  store  and  to  index  it  without  first  analyzing 
it.  and  (4)  to  store  it  and  to  index  it  using  an  analyzer.  An  analyzer  implements  a  certain 
policy  for  extracting  index  terms  from  text.  Lucene  already  implements  various  analyzers; 
we  used  a  StandardAnalyzer  for  our  tests.  Since  an  analyzer  determines  how  the  content 
of  a  document  is  represented  in  the  index,  it  provides  an  opportunity  for  the  developer  to 
implement  semantic  strategies  for  building  and  searching  the  Index.  The  Wikpedia  indexer 
first  parses  the  xml  Hie.  which  contains  the  articles  and  extracts  all  <page>- nodes,  from  which 
the  <title>  and  the  <text>  nodes  are  extracted.  Then,  for  every  article,  two  new  fields  are 
generated:  “title"  and  “text.”  These  fields  are  added  to  a  new  document,  which  is  then 
passed  to  the  IndexWriter-object.  After  this  process,  the  index  content  is  optimized  by  the 
writer,  concluding  the  indexing  process. 

The  searcher  was  implemented  in  Python  using  Py  Lucene.  using  a  StandardAnalyzer.  To 
be  able  to  search  for  an  article,  the  user's  search  string  has  to  be  converted  into  a  query.  This 
conversion  Is  done  by  a  Lucene  class  called  QueryParser.  which  is  generated  using  the  name 
of  the  field  that  contains  the  actual  content  and  the  analyzer.  The  query  is  then  passed  to 
the  searcher,  which  returns  an  object  called  “hits.”  This  object  holds  a  list  of  all  matching 
documents  with  an  assigned  score.  For  our  purposes,  the  searcher  application  just  prints  the 
titles  of  the  matching  articles,  followed  by  their  score. 

The  score  is  assigned  by  an  object  which  extends  the  Scorer  class  in  the  API.  The  scorer  it¬ 
self  uses  a  similarity  implementation  which  is  based  on  the  cosine-distance  between  document 
and  query  vectors  in  a  Vector  Space  Model  of  Information  Retrieval. 

4.4  WordNet  Query  Expansion  Sample 

Figure  4.4  shows  the  output  of  a  standard  keyword  search  using  only  a  single  word  in  the 
search  string  versus  the  output  of  the  same  search  after  expanding  the  search  with  the  Word- 
Net  interface.  Only  results  with  a  score  higher  than  .5  were  printed.  This  very  simple  sample 
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i.  Jiobl»mi  |  U  tev.do-  □  Lewo* 


V  Swieth 


<terr  n3ted>  Index/* taped  a  Um  App  cabcn]  CAF  *  *  9  *  r? 


3331: Gladstone  Gar-ler 

13301 : Guide  Fawkes  i  Usacfcigaaticn 

13101:Bk££l> 

10201:  Foreign  relations  cr  Hungary 

I  j  201 :  Hcncxc  rpnisrx 
13401  :Hllde  gar  d  or  Bir.gen 
1 3 501: Hear v  Sell ins 

13 €01: Heal  end 
10701: Horse  3reed 
13 501: Haas a  language 
10  SOI: Huey 

I I  CGI ; Harry  Better  and  th e  Sorcerers  Stcr.c 
11101 :  Iran ryzrr  m  India 

11201  :MIRC 

11201: Illuminati:  Hev  World  Oraer 
11101 :  Integer  sequences 
11501 :Iroqucis 

1 1601 :  Ger.Dii  e  :rprin:ir.g 
1 1701 : Insanity  defense 
1 1 501 :Tnpe rial  unit 
11501 : Joseph  Stalin 
12301: Judicial  pover 
12101: Java  -  Writing  an  Applet 
12201: Jackacn ,  Michigan 
1239X1 Jupxecr  ace 

12901 :Jonn  2ecrae,  Elector  of  Brancentour z 

12501:  Joan  Harrison 

12€01:Jane3  R.  Flynn 

12  <01 : Politics  or  Kyrgyratan 

Articles:  12738 

Optimizing  ... 

Tire:  231  Seconds 


n* 


Ctrl  Contrib  Boncrn 


Figure  3.  Output  of  the  Wikipedia  Indexer 


»>  -r  sea rehWlkipella (querystring > : 

query  —  pars-r.parse  |query3tnng) 
hit c  -  searcr.er .  search  I  guery) 
print  "Hits:  ”,  hits,  length  O 

f  i  rengejfl,  hits. length () ) : 

doc  “  hits. dor | i ) 

title  ~  doc . get (~ :  *  - * ) 

print:  i,*:  ",  title,  "score:  ",  hics.sccred) 


»>  surchXikipedie  |  ^  -  -er  JJ1D  lurch") 

Hits:  6 


5  : 

»> 


IOV arc.  Hunch  see  re:  0.999339540395 
Afterglev  score:  0.4‘M302€793 
Anc3t  accre:  0.23715133963 
rear  sccre:  0.152250815“11 
August  SI  score:  0.  Ilfl5‘'5€6  =  225 

August  score:  0.11771001 6660 


Figure  4.  Searcher  Output 
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»>  aeorchWilnpedia  ("Kc  ir 1 viTy") 

Hits:  1C- 6 

0  :  General  Relativity  secre:  1.0 

»>  s-orchWikdpedia  (  "text  :  rela  :  v  :  ty  r^x  :-:nr  •ir.") 

Hits:  206 

o  :  Ainert  Einstein  score:  l.o 

1  :  Inertial  frame  of  reference  score:  0 .  £12 7c 5300274 

2  Gravitational  redshift  scare:  0 . SC  164 599 41 2 6 

3  :  General  Reintivity  score:  0.707778377533 

i  :  General  relativity  accre:  0.75S9Q73543B 

5  :  Acceleration  score:  0. €55109173235 

6  :  Arthur  Stanley  Eddingtcr.  scare:  0 . 603332  3 407 17 

7  :  Cosine  censorship  hypothesis  score:  0 .57 875287S320 

3  :  Graviton  score:  0.577327372823 

a  :  Einstein  scare:  0.574950502964 

10  :  Faster— than-1 ight  score:  0 . 571S75691414 


Figure  5.  Comparison  of  Results  Using  a  Standard  Versus  an  Augmented  Search  String 


shows  how  using  synonyms  can  improve  a  search  significantly.  Note,  the  Wikipedia  article 
“Relativity”  does  not  appear  although  it  should  do  so  with  a  score  of  1.0.  The  explanation 
for  this  phenomenon  is  quite  simple:  the  article  is  not  in  the  corpus  because  all  experiments 
were  applied  on  only  12.000  articles,  which  is  less  than  1.7%  of  the  actual  corpus.  To  get  a 
perfect  hit  by  a  single  keyword  is.  therefore,  very  unlikely. 

4.5  Open  Issues 

The  Lucene  API  provides  several  access  points  through  which  it  can  be  extended  to  a  semantic 
search  engine.  Future  work  will  determine  how  a  document  has  to  be  represented  in  an  index 
to  enable  a  semantic  search.  This  will  involve  implementation  of  an  analyzer  representing 
the  policy  for  extracting  index  terms  from  the  corpus.  In  order  to  match  queries  against 
documents,  the  analyzer  will  need  to  transform  search  strings  into  a  representation  compatible 
with  that  of  the  documents  in  the  index.  Additionally,  in  order  to  rank  documents  matching 
the  query  according  to  a  scale  of  relevance,  we  will  need  to  implement  a  semantically  sensitive 
scorer. 

A  final  issue  of  research  is  how  to  use  WordNet  for  query  expansion  beyond  addition  of 
synonyms.  One  relation  between  wrords  that  is  wrorth  considering  is  certainly  the  hvpernym- 
hyponym  relation.  WordNet  already  provides  a  definition  for  this  relation  as  a  Prolog  file. 
Therefore,  a  parser  for  the  different  WordNet  files  should  be  included  in  the  implementation. 

5  Conclusion 

The  ReSEARCH  project  is  still  in  its  beginning  stages.  However,  w-e  have  made  great  strides 
in  identifying  the  fundamental  issues  involved  in  semantic  search  and  how-  we  will  need  to  deal 
with  them  in  the  context  of  SHARE.  Our  next  step  is  to  start  experimentation  with  proxy 
data  me  wiKipecua  data  reierred  to  a oove  and  to  plan  now  to  move  towards  live  shake, 
data  as  it  becomes  available.  Another  important  aspect  of  the  project  that  must  be  handled 
next  is  w’hat  the  summary  field  of  the  SHARE  card  catalog  must  contain. 
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The  ReSEARCH  Project:  What  we’re  up  to. 


•  Using  open-sourced  components,  we  want  to  design  a  semantic 
search  engine  for  requirements  documents  that  supports  the 
SHARE  repository 

•  We  need  to  match  over  the  meaning  of  a  requirement,  not  a 
question  or  a  query  string. 

•  Do  processing  to  “enrich”  both  the  query  and  the  documents 
with  semantic  information. 


•  Automatically  augment  ontologies  with  new  hypernomy  (“is  a”) 
and  mereology  (“is  a  part  of)  relations. 


-  That  is,  do  we  have  to  be  told  that  a  Hummer  is  a  vehicle,  and  one 
of  its  parts  is  a  steering  wheel,  or  can  we  discover  this  from  the 
text. 

•  Etc. 
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Why  use  semantic  search? 


•  Existing  keyword-based  search  engines  do 
not  take  into  account  the  semantics  of  the 
documents  they  are  searching. 


•  This  is  important  when  trying  to  find 
components  that  do  what  you  need,  not  what 
you  type. 
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Why  use  semantic  search? 


•  The  query  string  and  the  desired  documents 
may  not  use  the  same  phrases. 

Q:  What  fuel  does  the  F-22A  consume? 

A:  The  F-22A ’s  Raptor  uses  JP-8. 


Query  and  answer  convey  same  meaning, 
but  use  different  forms 

-  Here,  “consume”  and  “uses”  are  synonymous. 


i  P" 


.Acquisition  Research  Program:  Creating  Synergy  for  Informed  Change 


Nai'nl  ^cliwl 

Vlfinlcrfv.  <LA 


Prior  and  current  strategies 


Keyword  Search  Model 


auto  — 

—  car 

engine 

emissions  N 

bonnet . 

hood 

tyres 

make  ^ 

lorry 

model "" 

boot 

trunk 

make 
hidden 
Markov 
model 
emissions 
normalize 


Synonymy  Problem  Polysemy  Problem 
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Prior  and  current  strategies 


•  Brin  and  Page  (1998)  revolutionized  search 
by  using  PageRank,  which  changes  the  order 
in  which  the  pages  that  match  the  keywords 
in  the  query  are  returned. 


•  The  essence  of  the  Google  innovation  is  in 
how  the  PageRank  algorithm  works. 
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PageRank  algorithm 


The  rank  of  a  particular  page  depends  on: 

•  The  number  of  pages  pointing  to  it, 

•  The  rank  of  each  page  pointing  to  it, 

•  The  number  of  outgoing  links  on  each  those 

pages. 
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PageRank  algorithm 

•  PageRank  metric  PR(P)  defines  recursively 
the  rank/importance  of  each  page  P  by 


PR(T.)  PR(T ')  PR(T ) 

PR(P)  oc ^  +  ^ 


C^)  C(T2 ) 


C(Tn) 


where 


•  T 1 ,  T2, ...  are  all  the  pages  pointing  to  P 

•  each  Ti  has  C(Ti)  outgoing  links. 
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Random  Surfer 

To  further  determine  the  rank  of  all  web  pages 
Google  simulates  the  behavior  of  virtual 
surfers  randomly  surfing  the  web. 


A  page's  rank  is  then  updated  based  on  how 
frequently  the  random  surfers  visit  that  page. 


This  pre-existing  rank  of  each  individual  websjte 

is  assigned  independently  of  any  query. 
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Expert  Rank 


Ask.com  (Ask  Jeeves)  uses  the  ExpertRank 
algorithm: 

•  uses  the  number  of  incoming  links  as  well 

•  attempts  to  identify  topic  clusters  related  to 
search 


find  experts  within  these  topics  to  “seed”  the 
rank  of  some  websites  as  “expert”  sites. 

PageRank  has  the  problem  that  “correct”  is 
not  the  same  as  “highly-ranked.” 
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Current  Online  Semantic  Search 


•  Powerset  Labs  has  emerged  as  a  forerunner 
in  online  semantic  search  using  natural 
language  to  extract  facts  from  text. 

•  On  1 1  May  08,  Powerset’s  search  moved 
from  beta  to  a  public  release 

•  Currently,  Powerset  searches  only  Wikipedia 
documents,  but  intends  to  expand  search  to 
the  Internet  in  the  future. 


Powerset  Indexing  System 


Article  Outline 

DQ 

Show  Outline 

0  Show  Faetz 

Jamestown,  Virginia 

Virginia  comprised  Yorktown 

Virginia  comprised  Williamsburg 

Virginia  comprised  Triangle 

Virginia  comprised  Jamestown 

Jamestown  offered  area 

Jamestown  Yorktown  Foundation 
operated  site 

Jamestown  Yorktown  Foundation 
operated  colony 

Nearby  provided  link 

Nearby  offered  view 


service  offers  view 

site  operates  Colonial  Williamsburg 

Colonial  Williamsburg  Foundation 
operated  Colonial  Williamsburg 

miles  tour  Colonial  Williamsburg 

East  brought  visitors 


Powerset’s  algorithms  are  not 
publicly  available,  but  their 
behavior  can  be  inferred  from 
publicly  available  demos 

-  Powerset  parses  documents  to 
extract  “faetz” 

-  “Faetz”  are  generally  triples  of 
subject-verb-object 

-  Search  is  performed  over  these 
“faetz”  rather  than  the  full  text 
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Question  Answering 


•  Keywords  such  as  “When”  tell  the  system  how  to  narrow  results 


•  “W”  words  such  as  “Who”  and  “When”  act  as  wildcards  for  matching 
“factz,”  allowing  many  searches  to  be  matched  exactly 


When  did  earthquakes  hit  San  Francisco  search 


Wikipodia  Articles:  results  1  •  10  of  1964  advanced  j 

▼  1906  San  Francisco  earthquake  The  San  Francisco  earthquake  of  1906  was  a 
major  earthquake  that  struck  San  Francisco  and  the  coast  of  northern  California  at 
5:12  A  M.  on  Wednesday,  April  18  1906.  ...  There  were  decades  of  minor 
earthquakes  -  more  than  at  any  other  time  in  the  historical  record  for  northern 
California  -  before  the  1906  quake. 

▼  San  Francisco.  California  At  5:12  am  on  April  18  1906  a  major  earthquake 
struck  San  Francisco  and  Northern  California.  ...  Minor  earthquakes  occur  on  a 
regular  basis. 

—  I  I  Poo  CmUtjiiaolioo  - a  P—  loon  /MAC*'  V 

-  
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Question  Answering 

•  Other  functional  words  such  as  “From”  in  the  search  “Politician  from 
Virginia”  improve  results  significantly  over  searching  on  just  the 
keywords  “Politician”  and  “Virginia” 


Wikipedia  Articles 


Politician  from  Virginia 


Wikipodia  Articlos:  results  1  •  10  of  21 172 


search 


advanced 


▼  John  S.  Edwards  (Virginia  politician)  Senator  John  S.  Edwards  (born  6 
October  1943  (not  to  be  confused  with  2004  Vice-Presidential  hopeful  John  Edwards) 
is  an  American  politician  from  Virginia 

▼  William  Christian  (Virginia)  William  Christian  (c.  1743  -  9  April  1786)  was  a 
soldier  and  politician  from  Virginia  who  served  in  the  era  of  the  American 
Revolution. 

▼  James  Hay  (politician)  James  Hay  was  an  American  politician  from  Virginia 

▼  William  H.  Roane  William  Henry  Roane  (September  17.  1787  -  May  11,  1845)  was 

<C  3 
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Question  Answering 


•  Question  Answering  task  does  not  align 
exactly  with  requirements  document  search 

•  Requirements  documents  do  not  hold 
“Answers”  to  questions 


•  Encoding  of  facts  is,  however,  useful 


-  Computationally  less  demanding 

-  Efficient  use  of  storage  space  for  the  index 

-  Allows  domain  specific  constructs  of  facts  to  be 
formulated  and  recognized  in  the  corpus 
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Query  Expansion  with  Synonymy 

•  The  search  “What  do  zombies  eat?”  suggests  that  Powerset  searches 
for  synonyms  of  query  terms,  matching  “devour” 

•  Additionally,  stemming  matches  the  inverted  form  “eaten  by” 


A 

Dead  Rising  Examples  include  a  clown  who  became  insane  after  seeing  his 
audience  eaten,  a  manager  of  a  food  mart  obsessed  with  keeping  it  clean  and  free  of 
vandalism,  a  deranged  butcher  who  thinks  zombies  are  "spoiled  meat"  and  humans 
are  "fresh  meat",  and  a  Vietnam  War  veteran  stuck  in  a  war  flashback  after  seeing 

his  granddaughter  eaten  by  zombies. 

Characters  in  Work;  War_Z  He  mentions  several  ways  to  tell  a  zombie  apart  from 
a  quisling,  and  the  fact  that  early  reports  of  zombies  eating  quislings  led  to  a 
misunderstanding  that  zombies  fought  amongst  themselves  and  could  be  tricked  into 
killing  each  other. 

T  The  Walking  Dead  His  corpse  is  thrown  over  the  fence  to  be  devoured  by  the 
zombies,  watched  by  Hershel.  ...  Zombies  dont  need  their  organs. 

W 

1  ' _ Z  )f 
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Query  Expansion  with  Synonymy 


•  Stemming  of  terms  is  found  in  most  search  engines 
and  is  fairly  easy  to  perform 

•  Matching  synonyms  allows  “close”  matches  on 
meaning  without  requiring  an  exact  keyword  match 

•  Using  a  structured  ontology,  expansion  is  not  limited 
to  synonyms  but  may  be  extended  to  hypernyms, 
hyponyms,  and  meronyms  as  well 

-  Ontology  based  query  expansion  does  not  appear  to  be 
used  in  current  Powerset  searches 


-  One  of  our  primary  approaches: 

•  Research  Question:  Can  we  automatically  augment  a  given 
ontology  using  the  text  of  the  documents? 
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Discovering  Synonymous  Sentences 


•  Harris  (1954):  Synonymous  words  will  occur 
in  the  same  kinds  of  environments 

•  Lin  &  Pantel  (2001):  Synonymous  sentences 
will  contain  the  same  kinds  of  words 


'  'he  F-22A  consumes  JP-8 
'  'he  F-22A’s  engine  uses  JP-8 

•  Idea:  construct  sentence  similarity  metric 
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Discovering  Synonymous  Sentences 


•  Sentence  similarity  is  the  geometric  average 
of  the  similarity  of  the  positions  in  the 
sentence: 


sim(Xi  consumes  Yi ,  Xo’s  engine  uses  Yo)  =  \f  si m  ( A”  i ,  X  2 )  x  si  in  ( Yi ,  1 2 ) 
sim(Xj  Yj[,  A2  P2  Y2)  =  >/ si m ( Xj_ ,  X2 )  x  -si 77?  ( Y\ ,  I9 ) 


Discovering  Synonymous  Sentences 


•  Position  similarity  is  a  normalized  sum  of  the 
pointwise  mutual  information  of  all  words  that 
appear  in  both  positions  of  the  respective 
paths: 


.  ,  v  v  x  ^weT(pi ,s)nT(p2,s) {mi{^ ’ u?) +  a0) 

SI fllyX  1 ,  A 2 )  ./  \  •/  \ 

22w£T(phs)  rni(phs,w)  +  22wGT(p2,s) 


mi(Xi . .  .p . .  .Y^X^w) 


ftp- X]  =w) 

f(*,X±=w) 
f(p- X]  =*) 
f(*,Xi=w) 
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Discovering  Synonymous  Sentences 

•  Lin  &  Pantel  evaluated  system  against  TREC- 
8  Question  Answering  Task  question  set. 


Query 

#  Paths 

Accurac 

X  is  author  of  Y 

21 

52.5% 

X  is  monetary  value  of  Y 

0 

N/A 

X  manufactures  Y 

37 

92.5% 

X  spend  Y 

16 

40.0% 

spend  X  on  Y 

15 

37.5% 

X  is  managing  director  of  Y 

14 

35.0% 

X  asks  Y 

23 

57.5% 

asks  X  for  Y 

14 

35.0% 

X  asks  for  Y 

21 

52.5% 
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The  ReSEARCH  Project:  Work  for  us. 


•  Using  open-sourced  components,  design  a  semantic  search 
engine  for  requirements  documents  that  supports  the  SHARE 
repository 

•  Match  over  the  meaning  of  a  requirement,  not  a  question. 

•  Do  processing  to  “enrich”  both  the  query  and  the  documents 
with  semantic  information. 

•  Automatically  augment  ontologies  with  new  hypernomy  (“is  a”) 
and  mereology  (“is  a  part  of)  relations. 

-  That  is,  do  we  have  to  be  told  that  a  Hummer  is  a  vehicle,  and  one 
of  its  parts  is  a  steering  wheel,  or  can  we  discover  this  from  the 
text. 

•  Lots  more! 
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Backup  Slides 
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Building  the  Index 


[1 '  Problems  @  Javadoc  B  Console  £3  Search 

1=1 

B 

<terminated>  IndexWikipedia  [Java  Application]  C:\F  B  g  | 

a.  afPTWl  b  -  ri 

- 

9901 : Gladstone  Gander 

> 

10001 

Guido  Fawkes  (disambiguation) 

10101 

Herb 

10201 

Foreign  relations  of  Hungary 

10301 

Homomo  rphi sm 

10401 

Hildegard  of  Bingen 

10501 

Henry  Rollins 

10601 

Head  end 

10701 

Horse  Breed 

10801 

Hausa  language 

10901 

Huey 

11001 

Harry  Potter  and  the  Sorcerers  Stone 

11101 

Transport  in  India 

11201 

MIRC 

11301 

Illuminati:  New  World  Order 

11401 

Integer  sequences 

11501 

Iroquois 

11601 

Genomic  imprinting 

11701 

Insanity  defense 

11801 

Imperial  unit 

11901 

Joseph  Stalin 

12001 

Judicial  power 

12101 

Java  -  Writing  an  Applet 

12201 

Jackson,  Michigan 

12301 

Jupiter  ACE 

12401 

John  George,  Elector  of  Brandenburg 

12501 

John  Harrison 

12601 

James  R.  Flynn 

12701 

Politics  of  Kyrgyzstan 

' 

Articles:  12738 

Optimizing  . . . 

Time : 

234  Seconds 

A 

4 

► 

Ctrl  Contrib  (Bottom^ 
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A  Sample  Search  with  Lucene 


>>>  def  searchWikipedia ( querystring} : 

query  =  parser .parse (queryString) 
hits  =  searcher . search (query} 
pn:  r  "Hits  :  ",  hits  -length  (} 
fir  1  tr.  range  (0,  hits  .  length  (}  )  : 
doc  =  hits.doc (i) 
title  =  doc . get ( "trtle " } 

prim  i,  "  :  ",  title,  "score:  ",  hits  .  score  (i) 


»>  searchWikipedia  (  "screait  AND  munch"} 
Hits:  6 

0  :  Edvard  Munch  score:  0 . 999999940395 

1  :  Afterglow  score:  G . 4743026793 

2  :  Angst  score:  0.23715133965 

3  :  Fear  score:  0,142290815711 

4  :  August  31  score:  0.118575669825 

5  :  August  22  score:  0.117710016668 
>>> 
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Using  an  Augmented  Search  String 

»>  searchWibipedia  (  "Relativity"  J 
Hits:  106 

0  :  General  Relativity  score:  1.0 

»>  sear chWikipedia ( "text : relativity  text : einstein") 

Hits:  206 

0  :  Albert  Einstein  score:  1.0 

1  :  Inertial  frair.e  of  reference  score:  0.312765300274 

2  :  Gravitational  redshift  score:  0.301645994136 

3  :  General  Relativity  score:  0.737773377533 

4  :  General  relativity  score:  0.73440735403 

5  :  Acceleration  score:  0 . 6361091732 93 

6  :  Arthur  Stanley  Eddington  score:  0.603332340717 

7  :  Cosine  censorship  hypothesis  score:  0.573752375323 
3  :  Graviton  score:  0.577327572323 

9  :  Einstein  score:  0.574990303964 

10  :  Faster— than-Iight  score:  0.571375691414 

>» 
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Our  Work 


GOAL:  Design  an  alternative  method  to 
explicitly  store/represent  semantic  metadata 
in  order  to  enable  semantic  search. 
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